Beautiful Soup Documentation

您所在的位置：网站首页 › replace href › Beautiful Soup Documentation

Beautiful Soup Documentation

#Beautiful Soup Documentation | 来源: 网络整理| 查看: 265

Differences between parsers¶

Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML using the parser that comes with Python:

BeautifulSoup("", "html.parser") #
Since a standalone tag is not valid HTML, html.parser turns it into a tag pair.

Here’s the same document parsed as XML (running this requires that you have lxml installed). Note that the standalone tag is left alone, and that the document is given an XML declaration instead of being put into an tag.:
print(BeautifulSoup("", "xml")) # #
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using lxml’s HTML parser. Note that the tag gets wrapped in and tags, and the dangling
tag is simply ignored:
BeautifulSoup("
", "lxml") #
Here’s the same document parsed using html5lib:
BeautifulSoup("
", "html5lib") #

Instead of ignoring the dangling
tag, html5lib pairs it with an opening
tag. html5lib also adds an empty tag; lxml didn’t bother.

Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("
", "html.parser") #
Like lxml, this parser ignores the closing
tag. Unlike html5lib or lxml, this parser makes no attempt to create a well-formed HTML document by adding or tags.

Since the document “
” is invalid, none of these techniques is the ‘correct’ way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the ‘correct’ way, but all three techniques are legitimate.

Differences between parsers can affect your script. If you’re planning on distributing your script to other people, or running it on multiple machines, you should specify a parser in the BeautifulSoup constructor. That will reduce the chances that your users parse a document differently from the way you parse it.

【本文地址】

Beautiful Soup Documentation

Beautiful Soup Documentation

今日新闻

推荐新闻